Introduction

General information

The exam consists of 8 parts in which you are asked to conduct analysis of different datasets. Each part is focused on a different dataset. The datasets are included in different R packages and you need to install the packages to access the data. Your analysis should be done using R and your answers should be given in R. For example, if the question is

Question 0 (Example)

  1. Draw a random sample of size 100 from N(0,1).
  2. Produce a histogram for the sample.

Your solution should be

Solution for question 0.1:

x<-rnorm(100,0,1)

Solution for question 0.2:

hist(x)

You do not need to explain your R code. For example, you do not need to write: “the function hist() was used to produce the histogram.” Your answers to the questions should be the R code that you used to produce the output.

What do you need to submit as a solution for the exam?

You need to submit the following materials:

  1. R markdown program that can be used to conduct the analysis.
  2. PDF file version of the solution (produced using the R markdown program).
  3. HTML file version of the solution (produced using the R markdown program).
  4. In Question 8, you are asked to produce a presentation using R markdown. For this question you need to submit:
  • R markdown program that was used to produce the presentation.
  • PDF file of the presentation.
  1. In Question 10.3, include the png figure in the zip file that you submit.
  2. In Question 14.2, include the excel file that you need to create in the zip file that you submit.

What you do not need to write?

You do not need to interpret the results !!! For example, if the question is to fit a One-Way ANOVA model, you do not need to formulate the model and to interpret the results. This means, for example, that you do not need to write “the p-value is 0.007 indicating on a significant effect of the factor.”

When do you need to submit the solution?

  • Date: 12/01/2024.
  • Time: 17:00.

How to submit the solution?

You will need to upload your solution to BB. You will receive information about the submission by email.

Second part of the exam

The second part of the exam will take place (online) on 15/01/2024 at from 08:30 to 11:30.

Oral exam

The oral exam will take place on 16/01/24 and 17/01/24. The schedule is available online in BB.

Part 1: the real_data_GDI data

In this part of the exam, the questions are focused on the real_data_GDI dataset which is a part of the genderstat R package. To access the data you need to install the package. More information can be found on https://cran.r-project.org/web/packages/genderstat/index.html. Use the code below to access the data.

library(genderstat)
data("real_data_GDI")
names(real_data_GDI)
## [1] "country"                "female_life_expectancy" "male_life_expectancy"  
## [4] "female_mean_schooling"  "male_mean_schooling"    "female_gni_per_capita" 
## [7] "male_gni_per_capita"

Question 1

  1. How many countries are included in the data? Count missing values in each variable in the data.

  2. Create a new data frame without the missing data. How many countries are left in the data?

  3. Calculate the minimum and maximum for the variables life expectancy, mean schooling & gni per capita of both male and female.

  4. For each gender, sort the life expectancy of all countries from the highest to the lowest, and print the top country.

  5. For each gender, print the 15 countries with the highest life expectancy.

  6. How many countries have both female and male life expectancy higher than 80?

  7. Show the countries listed in question Q1.6.

Solution 1.1

Solution 1.2

Solution 1.3

Solution 1.4

Solution 1.5

Solution 1.6

Solution 1.7

Question 2

In this question, we use the dataset that was created in Q1.2 (the dataset without the missing values).

  1. Define a new categorical variable flife_cat in the following way: Re-code the variable female_life_expectancy into three categories:

    female_life_expectancy <60: Low.
    female_life_expectancy 60-80: Medium.
    female_life_expectancy >80: High.

Count how many countries are included in each category.

  1. Produce the pie plot and the barplot in a figure with one row of two panels, as presented in Figure 2.1.
Figure 2.1

Figure 2.1

  1. Define a new dataset in which you include the countries for which female are classified with low life expectancy. Sort the data by male life expectancy and print the top 3 countries.

  2. For the dataset in Q2.3, calculate the mean and standard deviation of male life expectancy and produce the output below.

##   mean_expectancy std_expectancy
## 1           52.65       2.562616

Solution 2.1

Solution 2.2

Solution 2.3

Solution 2.4

Question 3

In this question we use the real_data_GDI dataset without the missing values.

  1. Create a new data frame for countries with male life expectancy is higher than 53. How many countries are included in the new data set?

  2. For the new dataset, calculate a 95% confidence interval for the female life expectancy using a standard normal distribution. Note that you need to program the formula for the confidence interval by yourself.

  3. Write a function that receives a numerical vector and produces as a numerical output 95% confidence interval and the mean of numerical vector. Inside your function, use the R function t.test() to calculate the confidence interval and the mean. Apply this function to female life expectancy in the new data defined in Q3.1.

  4. Use the R package interpretCI (and the meanCI() function) to calculate the confidence interval for the female life expectancy using a standard normal distribution in the new dataset defined in Q3.1.

Solution 3.1

Solution 3.2

Solution 3.3

Solution 3.4

Question 4

In this question, we use the real_data_GDI dataset without the missing values.

  1. Produce the scatter plot in Figure 4.1. Note that the countries that are identified on the plot are all classified with low female life expectancy.
Figure 4.1

Figure 4.1

  1. Produce the scatter plot in Figure 4.2.
Figure 4.2

Figure 4.2

  1. Calculate the correlation between the variables female_mean_schooling and female_life_expectancy using the R function cor.test.

  2. Fit a linear regression model which includes the mean schooling for female as predictor and the life expectancy for female as dependent variable. Print only coefficients panel (coefficients, standard error, t values and p values).

  3. Produce a scatter plot of the female_mean_schooling vs female_life_expectancy, and add a regression line as shown in Figure 4.3.

Figure 4.3

Figure 4.3

Solution 4.1

Solution 4.2

Solution 4.3

Solution 4.4

Solution 4.5

Part 2 : the flying data

For the analysis of this part we use the flying data which is a part of the R package dropout. This is a modified version of the Flying Etiquette Survey data. More information can be found in https://CRAN.R-project.org/package=dropout. The code below can be used to access the data

library(dropout)
data("flying")
names(flying)
##  [1] "respondent_id"               "travel_frequency"           
##  [3] "seat_recline"                "height"                     
##  [5] "children_under_18"           "two_armrests"               
##  [7] "middle_armrest"              "window_shade"               
##  [9] "moving_to_unsold_seat"       "talking_to_seatmate"        
## [11] "getting_up_on_6_hour_flight" "obligation_to_reclined_seat"
## [13] "recline_seat_rudeness"       "eliminate_reclining_seats"  
## [15] "switch_for_friends"          "switch_for_family"          
## [17] "wake_passenger_bathroom"     "wake_passenger_walk"        
## [19] "baby_on_plane"               "unruly_children"            
## [21] "electronics_violation"       "smoking_violation"          
## [23] "gender"                      "age"                        
## [25] "household_income"            "education"                  
## [27] "location_census_region"      "survey_type"

Question 5

  1. Remove the missing values from the data. How many observations remain in the data?

  2. For the rest of question 5 we use the flying data without the missing values. Produce the data frame shown below, which shows the number of respondents for each age and gender category.

##     age gender   n
## 1 18-29 Female  75
## 2 18-29   Male  62
## 3 30-44 Female  78
## 4 30-44   Male  95
## 5 45-60 Female  95
## 6 45-60   Male 108
## 7  > 60 Female  87
## 8  > 60   Male  77
  1. Produce the box plot in Figure 5.1.
Figure 5.1

Figure 5.1

  1. Use a barplot to visualize the distribution of the gender across the factor levels of the age as shown in Figure 5.2.
Figure 5.2

Figure 5.2

  1. Produce plot in Figure 5.3.
Figure 5.3

Figure 5.3

Solution 5.1

Solution 5.2

Solution 5.3

Solution 5.4

Solution 5.5

Question 6

In this question, we use the flying data without the missing values.

  1. Produce the data frame below.
##   gender       baby_on_plane   n percentage
## 1 Female No, not at all rude 255  76.119403
## 2 Female  Yes, somewhat rude  58  17.313433
## 3 Female      Yes, very rude  22   6.567164
## 4   Male No, not at all rude 214  62.573099
## 5   Male  Yes, somewhat rude  89  26.023392
## 6   Male      Yes, very rude  39  11.403509
  1. Produce the plot in Figure 6.1.
Figure 6.1

Figure 6.1

  1. Count the distribution of the respondents’ answers (from each gender and age group) to the question “is it rude to bring a baby on a plane?”.

  2. Produce plot in Figure 6.2.

Figure 6.2

Figure 6.2

  1. Produce the plot in Figure 6.3 which shows the distribution of the male respondents’ answers to 5 questions:
  • “in general, is it rude to bring a baby on a plane?” (baby_on_plane)
  • “is it rude to ask someone to switch seats with you in order to be closer to family?”
    (switch_for_family)
  • is it rude to move to an unsold seat on a plane?” (moving_to_unsold_seat)
  • “generally speaking, is it rude to say more than a few words to the stranger sitting next to you on a plane?”
    (talking_to_seatmate)
  • “is it rude to wake a passenger up if you are trying to walk around?” (wake_passenger_walk)
Figure 6.3

Figure 6.3

Solution 6.1

Solution 6.2

Solution 6.3

Solution 6.4

Solution 6.5

Question 7

In this question we focus on the flying data without missing values.

  1. We focus on the variables gender and baby_on_plane. Produce the \(2X3\) table shown below.
##         baby_on_plane
## gender   No, not at all rude Yes, somewhat rude Yes, very rude
##   Female                 255                 58             22
##   Male                   214                 89             39
  1. Use a chi-square test to test the hypothesis gender and baby_on_plane are independent.

  2. Define an R object for the test statistic, plot the density plot of the test statistic under the null hypothesis and add the line for the observed test statistic.

Solution 7.1

Solution 7.2

Solution 7.3

Question 8

Prepare a presentation of 5-10 slides using R markdown about the connection between the gender and the variable baby_on_plane. Make sure that your presentation includes:

  • A Title slide.
  • At least one slide with text.
  • At least one slide with a figure
  • At least one slide with text and a figure.

Please note that you WILL NOT be asked to give the presentation and you WILL NOT be asked questions about the presentation. Your aim in this question is to demonstrate that you know how to use R markdown to make a presentation about your analysis. More details how to make a presentation using R markdown: https://rmarkdown.rstudio.com/lesson-11.html.

Part 3: the unemp data

In this part of the exam, the questions are focused on the unemp dataset which is a part of the viridis R package. To access the data you need to install the package. More information can be found in https://cran.r-project.org/web/packages/viridis/viridis.pdf. Use the code below to access the data.

library(viridis)
data(unemp)
names(unemp)
## [1] "id"          "state_fips"  "county_fips" "name"        "year"       
## [6] "rate"        "county"      "state"

Question 9

For the unemp dataset,

  1. How many observations are included in the dataset? How many states are included in this dataset?

  2. How many counties there are in NY?

  3. Create a new data frame named unemp_NY for NY state. Produce the following output for the variable rate:

##   min_rate_NY max_rate_NY mean_rate_NY
## 1         5.6        13.3     8.009677

Solution 9.1

Solution 9.2

Solution 9.3

Question 10

Create a new data frame named sub_unemp, which includes data of 3 states: GA, TX and VA.

  1. How many observations are included in the new data frame?

  2. Produce Figure 10.1 presented below.

Figure 10.1

Figure 10.1

  1. Save Figure 10.1, produced in Q10.2, as a png file and include it in the zip file of your solution.

  2. Conduct a t-test to test the hypothesis that the unemployment rate in states TX and VA is equal against a two-sided alternative. What is the value of the test statistic? How many observations were included in the analysis?

  3. Create a new R object that contains the upper and lower limit of the \(95\%\) confidence interval for the mean difference. DO NOT use xxx<-c(-0.2592,0.6928).

  4. Test if the variance of the unemployment rate in the two states is equal.

  5. If needed, adjust your analysis in Q10.4 according to the result obtained in Q10.6.

Solution 10.1

Solution 10.2

Solution 10.3

Solution 10.4

Solution 10.5

Solution 10.6

Part 4: the pigs data

In this part, the questions are focused on the pigs dataset which is a part of the emmeans R package. To access the data you need to install the package. More information can be found by help(pigs). You can use the code below to access the data.

library(emmeans)
data(pigs)
names(pigs)
## [1] "source"  "percent" "conc"

Question 11

In this question, we use the pigs dataset without the missing values.

  1. Add to the pigs data a category variable percent_class that takes the value of “high” when the percent is 18, “moderate” when the percent is 12 or 15, and “low” when the percent is 9. What is the proportion of the pigs for which percent_class is high? Produce the plot presented in Figure 11.1.
Figure 11.1

Figure 11.1

  1. For the new data, compute summary statistics (count, mean, sd) of the concentration of free plasma leucine (the variable conc) by the variable percent_class.

  2. Use the function aov() to fit a one-way ANOVA model in which the concentration of free plasma leucine (the variable conc) is the independent variable and the protein percentage in the diet (percent_class) is the factor.

  3. Print the ANOVA table for the model.

  4. Create a new R object, F.value, that contains the value of the F test statistics. DO NOT use F.value=1.858.

  5. Produce the diagnostic plots (qq normal plot for residuals and histogram for residuals) presented in Figure 11.2 and 11.3 below.

Figure 11.2

Figure 11.2

Figure 11.3

Figure 11.3

Solution 11.1

Solution 11.2

Solution 11.3

Solution 11.4

Solution 11.5

Solution 11.6

Part 5: the fish data

In this part we use the data fish which is a part of the rrcov R package. To access the data you need to install the package. More information can be found in https://search.r-project.org/CRAN/refmans/rrcov/html/fish.html. You can use the code below to access the data.

library(rrcov)
data(fish)
names(fish)
## [1] "Weight"  "Length1" "Length2" "Length3" "Height"  "Width"   "Species"

Question 12

In this question we use the fish dataset WITH the missing values.

  1. Produce a frequency table for the number fish for each species.

  2. Observation 14 has a missing value in variable Weight. Remove this observation from the data and create a new dataset, fish2. Use the new dataset to create a bar plot for the weight by species as shown in Figure 12.1.

Figure 12.1

Figure 12.1

  1. For the new dataset created in Q12.2, produce Figure 12.2, a scatter plot for Width vs. Weight by Species.
Figure 12.2

Figure 12.2

Solution 12.1

Solution 12.2

Solution 12.3

Question 13

For the new dataset defined in Q12.2.

  1. Use a for loop to calculate the correlation between Weight and Width for each species. This implies that for each step in the for loop another species will be selected and the correlation between Weight and Width will be calculated and printed.

  2. Produce the following output WITHOUT using a for loop. Note that the variable Correlation is the correlation between Weight and Width.

## # A tibble: 7 × 2
##   Species Correlation
##     <int>       <dbl>
## 1       1      0.344 
## 2       2      0.216 
## 3       3      0.449 
## 4       4      0.638 
## 5       5      0.648 
## 6       6      0.0430
## 7       7      0.563

Solution 13.1

Solution 13.2

Question 14

  1. For the dataset created in Q12.2 calculate the mean for the variables Weight, Length1 and Height by species and produce the data frame below.
## # A tibble: 7 × 4
##   Species avg_w avg_L1 avg_h
##     <int> <dbl>  <dbl> <dbl>
## 1       1 626     30.3  39.6
## 2       2 531     28.8  29.2
## 3       3 152.    20.6  26.7
## 4       4 155.    18.7  39.3
## 5       5  11.2   11.3  16.9
## 6       6 719.    42.5  15.8
## 7       7 382.    25.7  26.3
  1. Save the table that you produced in Q14.1 as an excel file and add this excel file to the zip file with your solutions.

Solution 14.1

Solution 14.2

Part 6: the msleep data

In this part we use the data msleep which is a part of the ggplot2 R package. To access the data you need to install the package. More information can be found in https://github.com/tidyverse/ggplot2/blob/main/data-raw/msleep.csv. You can use the code below to access the data.

library(dplyr)
data("msleep", package = "ggplot2")
head(msleep, 5)
## # A tibble: 5 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Question 15

  1. How many observations and variables are included in the data?

  2. Create a summary table of average sleep time (the variable sleep_total) for each level of the variable order, sorted in descending order of average sleep time.

Solution 15.1

Solution 15.2

Question 16

  1. For the msleep dataset, produce Figure 16.1 to visualize the relationship between the bodywt and brainwt variables. Add a regression line to the plot. As shown in Figure 16.1, take key note on the color.
Figure 16.1

Figure 16.1

  1. Identify the outlying observations for which the body weight (the variable bodywt) is higher than 2000. Add the value of the body weight to the figure (inside the frame) as shown in Figure 16.2.
Figure 16.2

Figure 16.2

  1. Create a new data frame without the two outlying observations. Produce the scatterplot shown in Figure 16.3.
Figure 16.3

Figure 16.3

Solution 16.1

Solution 16.2

Solution 16.3

Part 7: the ChickWeight data

In this part we use the ChickWeight data which is a part of the R datasets. To access the data you need to install the package. More information can be found in https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/ChickWeight. You can use the code below to access the data.

data(ChickWeight)
head(ChickWeight)
## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Question 17

  1. Write a function that receives a dataset and a variable as an input and output returns the mean, median, and standard deviation of the variable rounded to 2 decimal places. Apply this function to the ChickWeight data and the variable weight.

  2. In the output below, both numerical and graphical output were produced using the user function my.analysis(). The function receives as an input: (1) a dataset name, (2) the column number of variable 1 (a numerical variable) and (3) the column number of variable 2 (a factor). Note that both variable 1 and variable 2 are a part of the dataset. For the analysis in this question we use the ChickWeight dataset at time 0 (so only observations at time 0 are included). The output was produce using the following code: my.analysis(ChickWeight0,1,4).
    The dataset ChickWeight0 contains the observations that were measured at time 0. Based on the output below, your task in the question is to write the function my.analysis() and to produce the output using the code above. Note that your function should produce an identical output.

## $`Summary statistics`
##   Diet Mean        SD  n
## 1    1 41.4 0.9947229 20
## 2    2 40.7 1.4944341 10
## 3    3 40.8 1.0327956 10
## 4    4 41.0 1.0540926 10
## 
## $`ANOVA table`
##                 Df Sum Sq Mean Sq F value Pr(>F)
## dataset[, var2]  3   4.32   1.440   1.132  0.346
## Residuals       46  58.50   1.272               
## 
## $`Sample mean and 95% CI`

## 
## $`Plot of the residuals`

Solution 17.1

Solution 17.2

Question 18

  1. Create the boxplot shown in Figure 18.1 to visualize the distribution of weights for each time point in the ChickWeight dataset. Color the boxplot based on the Time variable.
Figure 18.1

Figure 18.1

  1. Create an interactive boxplot, shown in Figure 18.2, to visualize the distribution of weights for each time point in the ChickWeight dataset. Color the boxplot based on the Time variable. DO NOT include this figure in the PDF document for your answers but ONLY in the HTML document.

Figure 18.2

Solution 18.1

Solution 18.2

Part 8: the Quakes data

In this part we use the data quakes which is a part of the R datasets collection. Use help() to get more information about the data. More information can be found in https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/quakes. You can use the code below to access the data.

library(datasets)
data("quakes")
head(quakes)
##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

Question 19

  1. Create the 3D scatter plot presented in Figure 19.1 to illustrate the relationship between latitude (lat), longitude (long), and depth (depth) of earthquakes in the quakes dataset.
Figure 19.1

Figure 19.1

  1. Create an interactive 3D scatter plot, shown in Figure 19.2, to illustrate the relationship between latitude (lat), longitude (long), and depth (depth) of earthquakes in the quakes dataset. DO NOT include this figure in the PDF document for your answers but ONLY in the HTML document.

Figure 19.2

Solution 19.1

Solution 19.2

Question 20

  1. Calculate the mean Richter Magnitude (the variable mag) by the station (the variable stations).

  2. Create a new dataset which contains the observations from the top 25 stations with the highest Richter Magnitude. How many observations are included?

  3. For the new data, define a new variable that is equal to the ratio between Richter Magnitude and the depth, i.e., \[ratio=\frac{ Richter Magnitude}{depth}\].
    Sort the data according to the variable ratio.

  4. Print the three stations with the highest mean ratio.

  5. Create a new dataset for the stations with ratio higher than 0.099. For the new data, produce Figure 20.1.

Figure 20.1

Figure 20.1

  1. For the dataset created in Q20.2 and Q20.3, create a new categorical variable (mat_cat) that takes the value of 0 if the Richter Magnitude (the variable mag) is below the overall mean and 1 otherwise. Produce Figure 20.2.
Figure 20.2

Figure 20.2

Solution 20.1

Solution 20.2

Solution 20.3

Solution 20.4

Solution 20.5

Solution 20.6